Latin Etymologies as Features on BNC Text Categorization
نویسندگان
چکیده
This paper presents an early experimental work on BNC Text Categorization (TC) with Latin etymologies as features, emphasis on spoken and written texts. Two aims achieved in this study: (1) to explore discriminative new linguistic features rather than lots of noise-bringing “bag-of-words” (BoW). (2) to build up a base step to represent texts in distinct types of linguistic features with different weighting scheme rather than a plain feature vectors of BoW. The experiments disclose a notable distinct distribution pattern of Latin etymologies in spoken and written BNC texts. The performance of a home-made classifier based on the probability distribution ranges of Latin etymologies reaches a precision of 72.31% and recall of 73.22% on BNC spoken texts and precision of 73.31% and recall of 69.98% on BNC written texts.
منابع مشابه
Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملCorpus Linguistics with BNCweb - a Practical Guide
Book synopsis This book presents a richly illustrated, hands-on discussion of one of the fastest growing fields in linguistics today. The authors address key methodological issues in corpus linguistics, such as collocations, keywords and the categorization of concordance lines. They show how these topics can be explored step-by-step with BNCweb, a user-friendly web-based tool that supports soph...
متن کاملFuck revisited
This paper is a follow up to the investigation of McEnery, Baker and Hardie (2000) into the use of the word fuck in spoken British English. Both that paper and this are based on the British National Corpus. However, at the time of writing in 2000, the analysis of fuck in the written BNC had not been completed, hence the 2000 paper focussed on spoken English alone. In doing so, it explored the w...
متن کاملWeb as Corpus
The corpus resource for the 1990s was the BNC. Conceived in the 80s, completed in the mid 90s, it was hugely innovative and opened up myriad new research avenues for comparing different text types, sociolinguistics, empirical NLP, language teaching and lexicography. But now the web is with us, giving access to colossal quantities of text, of any number of varieties, at the click of a button, fo...
متن کاملThe Creation of a Spoken Sub-Corpus from the British National Corpus for Comparative Purposes
The British National Corpus (henceforth BNC) is one of the most frequently consulted corpora in linguistic research. While the use of this corpus is continuously on the increase, it appears that most BNC-related research work has exploited the corpus in its entirety, i.e. taking the corpus as a whole in analysing specific features or comparing with a different reference corpus. Despite the fact...
متن کامل